Analyzing Spatial data from the GoFord bike sharing service

by Lucas Valerio de Oliveira

Table of Contents

Introduction

According to Wikipedia, Ford GoBike is a public bike sharing system in California's San Francisco Bay region. Initially known as Bay Wheels, Ford GoBike is the first regional and large-scale bike-sharing system deployed in California and the west coast of the United States. It was established as bay area bike share in August 2013. As of January 2018, the Bay Wheels system had more than 2,600 bikes at 262 stations in San Francisco, East Bay and San Jose.

In this study, data provided by the bike sharing program during the period of February 2019 will be analyzed. The data will be analyzed through an exploratory analysis and finally an explanatory analysis of the data will be made.

Preliminary Wrangling

Gathering and Assessing Data

What is the structure of your dataset?

The data has 183412 rows of records and 16 columns of data. Some tables have null data that needs to be analyzed to decide whether to be treated or remove. egarding the type of data, it is observed that date and time variables need to be treated for the DateTime type. Fields that have some ID identifier need to be converted to String, and finally the Birthday Year variable deve der analisada, uma vez que foi identificado individuos should be analyzed, since it was identified individuals who have a date of birth of 1878 and therefore we should analyze the case. Finally, the data are from the period of February 2019

What is/are the main feature(s) of interest in your dataset?

What interested me most in the data was the desire to find out how the data is distributed spatially and then build a real graphical representation of that distribution,we can use the LAT data, LONG for that.

In addition, I will try to find out which factors influence the duration of the trip in terms of date and time, age of users, point of departure and point of arrival and also in relation to the gender of the user.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

The first part of my analysis will be important to evaluate the latitude and longitude data in relation to the following characteristics: the duration time, ages, gender and type of user. We can analyze these characteristics and see how they are distributed along the map. Regarding travel time I need to mainly evaluate start time information, station information and user characteristics.

Cleaning

In this session we will adjust some information and improve the quality of the data


Fix all wrong types and remove null birth and gender data

As stated earlier, we need to adjust the data type to ensure better data quality. The data for time and date, id-to-string data, and gender data for category will be made.

Test


Check and adjust Birthday Year values

We will remove rows that are Null and will not interfere with the study, since a wrong fill can cause deviations in the final results. The distributions of the ages will then be evaluated and we will finally remove the outliers


Evaluate the data of the stations

We will now plot the stations on the map and evaluate the null data that was found.

Test

Analyze null data from bike stations

In this session we will investigate the integrity of the longitude and latitude variables and how are the map distributions.

First I will plot all stations on the map and evaluate the case of stations that are without the ID and without the name information, but are lat,long informed.

Through the previous visualization, it is noted that the null data match the same area, so removing this data will not affect the studies, since there are no offsets between areas with null name and non-null. That way I will choose to remove the data, because each station corresponds to a street name and to adjust the information would be necessary to tidy up given the data, which would be very laborious. However, if it were relevant to the study it would be important to retrieve the information since it could bring relevant information about the behavior of the data area.

This relationship of clusters can tell us that there is a similarity between latitude, longitude and data can be grouped together to define which regions we are studying.

How are the stations distributed? What are the regions of the study?

To answer these questions we will look for what are the relationships between the pairs of geographic location latitude and longitude. And then check if there is a relationship of similarity of groups between the pairs of coordinates.

Note that the data is separated by regions, I will create a function to get these groupings through cauterization and assign a centroide to each of them that I will call the macro region, then I will increase the number of clusters to get the micro regions along the data.

In parallel I want to look for what are the names of the regions since the cauterization method only provides us with a centroide. In this way I built a function based on a use of the GeoPY Python library, which accesses the information from a repository.

An example will be made to get macro regions and micro regions. These functions can be used to assign new information to centrodes and enrich study data.

The following are the macro and micro regions defined by the Kmeans cauterization method:

With the data adjusted in the proper way I will begin the next session evaluating each variable important for spatial analysis and in relation to the trips made.

Univariate Exploration

Relative to time variables

We will start evaluating the distribution of trips over time. It will be analyzed in the following ways: by hour, day of the week, per week, and by Day in the Month.

How is the distribution of travel over the hours of the day?

Observation:

Based on the graphs above it is observed that there is a higher demand for the service in the hours between the first part of the morning 8am and the late afternoon 5pm. This is associated with the time that people are leaving home more for work and their appointments and the time that people are returning from their appointments. It is observed that during the early hours and the time that has less the search for the service.

How is the distribution of travel over the day in weeks?

Observation:

Regarding the use during the days of the week, the service has higher demand during Mondays and Friday, and Thursday is the day when demand is higher. On weekends, demand for the service decreases by almost half.

How is the distribution of travel over the days in month?

Observation:

During the month, the pattern of higher frequency of use during weekdays and the reduction on weekends are noted. An important point to investigate is the relationship of time during weekend and weekdays, and to verify how this trip occurs in terms of time and distances from the displacements. In addition we can check if the age of people who use the service during the week is different from weekends.

Regarding user variables

What is the distribution of genders across the data from bike trips?

Observation:

It is evident that the program has a greater use by men in about 80000 users more male than female.

How is the distribution of users' birth dates?

Observation:

users have an age distribution inclined to the right and with high frequency values ​​in the region of birth in 1990. The smallest year was 1940, showing that there are users among all age groups, but with greater concentration in young people, adults and lower in the elderly.

How often are users subscribed to the bike sharing program?

Observation:

It is noticed that the number of registered users is approximately 8 times greater than the customers. This indicates that there is a high enrollment rate for the use of bikes

What are the most used stations? Is there a segregation of regions in terms of the distribution of latitudes and longitudes?

Observation

In a quick analysis of origin and destination, we can notice that the displacement occurs mostly with starting station Market St at 10th St and end at san Francisco Caltrain Station 2. This indicates that we can find a pattern of displacement of people in the region.

In addition, histograms reinforce the idea that displacements are made in specific groups of the bike program region and therefore we can group them through specific techniques, this would enrich our data in relation to the regions of the program since there are no displacements between the large groups of data region. We could study what is the urban mobility pattern of the region with source and destination data and include user information.

What is the distribution of travel times?

Observation

It is observed that it was necessary to make an adjustment of the plotted data, referring to the duration of the trip. This is because there was a very high value of 84548 seconds. In these senses we can disregard trips longer than 1 hour or so 3600 seconds, because they are very destonating values of the general distribution of travel time.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The variation that caught my attention most were latitude and longitude, since these distributions seem to be concentrated in fixed regions, as seen in the graph of the data cleaning session. I'll see how they're related and try to group them together to generate more information in the study regions relationships. It was not necessary to perform transformations in the data of lat,long

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The travel time variable showed a distribution very out of the ordinary, I had to adjust the data based on the analysis of the 99% quartile and found that only one information was given discrepant time and i chose to remove it and then generate a new distribution chart that seemed much more realistic.

Bivariate Exploration

Let's analyze the correlations and look for what relationships exist between the data

Which columns are correlated?

Observation

We observed that the correlation between latitude and longitude data are strongly related while the age and duration variables are partially related. Therefore, we will evaluate the data regions to follow the studies. In general, information does not bring much news, but it is important to follow the approach of spatial analysis, because it demonstrates a strong relationship between this information.

Through the results obtained above we can notice that the displacement between macro regions almost does not happen, but between the micro regions, especially those that belong to macro regions at the bottom of the chart, have many trips that start in one micro and end in another given the user's starting origin.

Evaluating data against macro regions

What is the relationship of the date of birth year within the macro regions?

Observation

We can see from this graph that the people who are in Macro region 1 have their birth date around 1995, we can say that the members of this region in general are younger than the members of other macro regions. who have more distributed ages along the dates of birth. The members of the macro region 1 have an age distributed around the year 1985, while the members of the macro region 2 have more frequent births in 1988 and later a greater

How does the year of birth vary with travel times?

Oservation

The data obtained shows that there is a higher concentration of trips for members born between 1980 and 2000, indicated by the strong color of the data. In addition, longer journeys begin to fall as the limb age increases. In general, younger members make trips with longer time, but are more focused on a value of 500 seconds of travel, which is around 10 minutes of travel. This is an interesting result for evaluating how each area behaves in relation to the general view.

How do average distances relate to macro regions?

Observation:

We can observe that the region 0 is the one with the longest average travel time, this can be evaluated to know how the regions of the study are and to know what is the spatial relationship with this data.

How gender are relate to macro regions?

Observation

Here the genre follows the same pattern as the general data. There is a big difference between men and women

How user type are relate to macro regions?

Observation

Here the genre follows the same pattern as the general data. There is a big difference subscriber and customer

What are the relationships between the temporal variables and the macro regions of the study?

Observation

We can notice that during the hours of the day, days of the week and days in the month, the macro region 0 is superior to all the data in the same pattern that we observed in univariate analyzes. This pattern can be seen in other regions.

Evaluating data for micro regions

What is the relationship of the date of birth year within the micro regions?

Observation

When we evaluate the microregions, we evaluate d.A. two situations that are very different from the others, which are region 1 and region 3. According to the initial data of the session, we observed that there was a micro region that was basically equal to the macro region and therefore we can say that due to the similar behavior between micro region 1 and macro region 1 these data are the same. already paara micro region 3 it behaves similar to macro region 2. which is where the micro region is located.

What is the average travel time in micro regions?

Observation

We observed that micro regions 2 and 4 are the ones that have on average the longest duration of bike use and are longer or close to the values of the macro regions that are inserted. The shortest average time was with region 1 that shares the same value as the macro region.

How the types and genres of users are distributed in relation to micro regions

Observation

Here the genre follows the same pattern as the general data. There is a big difference subscriber and customer

What are the relationships between micro-regions and people's data in relation to time variables?

Observation

We observe here a pattern where the micro areas it ends up showing in a little more detail the variation of the data, in relation to when we observed in a macro view. This approach shows the importance of including some information to the data that can be approached under the same perspective, but that generate totally different data.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Regarding the macro data, we observed a very important characteristic in relation to the mean distances of the regions and how the distribution of ages in each of them varies. In relation to the data of regions, the region with the highest movement of bicycles in all periods of day and time was region 0 followed by region 2 and finally region 1. This was interesting because in addition to showing more use of the services it is the region that has the most user, but the region that has the least user shows a constancy of use over time.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, I noticed that micro regions have large variations in relation to the first approach with macro regions and they can elucidate the relationship of distances with paths better than macro regions, since it captures these movements between zones.

Multivariate Exploration

In this session I will discuss in a limited way the relationship of the bike stations in terms of the duration of the trips, time of departure and arrival and origin and destination of the data.

Macro region 0

I will start the study of the macro region 0 first i will analyze what are the characteristics in terms of the departure and arrival of the region in relation to the time of travel made by the user grouping the data by the stations of the region

Questions:

Observation:

The results show that the people in the center of region 0 have long displacements in the peripheries of the region, evidenced by the first graph and short displacements in the centers. Graph two shows that the short displacements made in the region are widely used both by the morning and in the afternoon, while in the peripheries the schedule tends to be used in the morning or in the afternoon, this can be the effect of the people who will perform the task in the central region and at the end of the day returns to their homes.

Macro region 1

Questions:

Observation

The results of macro region 1 have interesting characteristics, the first is that it is composed of 3 micro regions and that it suffers short displacements in the center and increases as it moves away. An interesting characteristic and the highest average time occurs in the central region, this is due to the return of people to their homes, while in the more remote regions the average time is closer to the first part of the morning, indicating a trip to the center. Finally, the analysis of trips between micro regions is evident in the last graph that shows a displacement from 3 zones to the central position of the data. That is, the most likely destination of the pattern of displacement of people is to the central region and less common to the peripheries.

Macro region 2

Questions:

Observation

Region 2 has the same behavior of displacement and time of the others, but a curious fact and that when we evaluate the micro regions we observe that there are trips of people who leave the Macro zone 2 and go towards macro zone 1, that is, we have here a mobility between macro zones which we did not observe in previous cases.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

We observed that the central regions of the space distributions analyzed have large user movements, but in shorter times of use. People living on the outskirts tend to make longer journeys and often the destination is the center. In addition to the temporal variable, we have a more use of the service at the next time of the next 15 hours in the center in all regions, indicating that people at the end of the day are seeking to move from the center to other regions.

Were there any interesting or surprising interactions between features?

Among all macro regions studied only macro region 2 presented displacement data between macrozones. Moreover, it is very evident that the displacement between the microregions is linked to the central part of the macro regions studied.

References

https://jakevdp.github.io/PythonDataScienceHandbook/04.13-geographic-data-with-basemap.html https://geopy.readthedocs.io/en/stable/index.html?highlight=latitude#geopy.location.Location.latitude https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html https://plotly.com/python/scattermapbox/ https://www.geeksforgeeks.org/get-the-city-state-and-country-names-from-latitude-and-longitude-using-python/